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Abstract 

We consider the problem of estimating the regression function in functional linear 
regression models by proposing a new type of projection estimators which combine 
dimension reduction and thresholding. The introduction of a threshold rule allows to 
get consistency under broad assumptions as well as minimax rates of convergence under 
additional regularity hypotheses. We also consider the particular case of Sobolev spaces 
generated by the trigonometric basis which permits to get easily mean squared error of 
prediction as well as estimators of the derivatives of the regression function. We prove 
these estimators are minimax and rates of convergence are given for some particular 
cases. 
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1 Introduction 



Functional data analysis ( Ramsay and Silverman ( 20051 ). Ferratv and Vieu ( 200c)) is atopic 
of grow ing intere st in statistics and many ap plications in chemometrics ( Frank and Friedman! 
(US)), finance (jPreda and Saportal ([20051 ) h biometry or climatology (jBesse et al.l (|200d )) 
are now dealing with the functional linear model. This model is useful to estimate or predict 
a scalar random variable, say Y S M, thanks to a random function denoted by X. We assume 
in the following that Y and X are centered random variables and, without loss of generality, 
that the random function X takes values in L 2 [0, 1], the space of square integrable functions 
defined on [0, 1] endowed with its usual inner product (/, g) = Jq f(t)g(t)dt and associated 
= (/, Z) 1 / 2 , /, g £ L 2 [0, 1]. The functional linear model is then defined by 



norm 



Y 



(3(t)X(t)dt + ere, a > 0, 



(1.1) 



where the function f3(t) is called the regression or slope function and the error term e is 
supposed to be centered E(e) = and not correlated with X: V t G [0, 1], E(X(i)e) = 0. 
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Assuming that X has a finite second moment, i.e. E||X|| 2 = K\X(t)\ 2 dt < oo, one can 
define the covariance operator of X, say V. This operator is defined on L 2 [0, 1] as follows: 
for any function / G L 2 [0, 1], 



cov (X(t),X{s))f(t) dt, Vs G [0,1]. 



(1.2) 



It is well known (see e.g. ICardot et al. I (ll999T n that the regression function /3 satisfies the 
following moment equation 



g(s):=E[YX(s)} = [Tp}(s), sG[0,l], 



(1.3) 



where g belongs to L 2 [0, 1]. Since T is a non negative nuclear operator ( Dauxois et al.1 
(jl982h ) a continuous generalized inverse of T does not exist as long as the range of the 
operator T is an infinite dimensional subspace of L 2 [0, 1]. Co nsequently inverting equation 
(|1.3j) to recover (3 can be seen as an ill posed inverse problem. ICardot et ail (j2003l ) provides 
a necessary and sufficient condition for the existence of a unique solution of equation (jl.3p 

Assumption 1.1. The covariance operator V of the random function X is injective and the 
function g = E[7X] belongs to the range 7£(r) ofY. 

Under this assumption, the covariance operator V admits a discrete spectral decom- 
position given by a sequence (Xj)j^ of strictly positive eigenvalues and a sequence of 
corresponding orthonormal eigenfunctions {4>j}j£fq. Then, the normal equation (|1.3|) can be 
rewritten as follows 



J6N 



a 

A, 



4>j with gj := (g,4>j), j G N. 



(1.4) 



It is well-known that, even in case of a-priori known eigenvalues {Aj} and eigenfunctions 
{4>j}, replacing in (jl.4jl the unknown function g by a consistent estimator 'g does in general 
not lead to a consistent estimator of (3. To be more precise, since the sequence (Aj)jgpj 
tends to zero, E||g — g\\ 2 = o(l) does generally not imply 2~^jeN l-^il" 2 '^\(9~ 9> 4>j)\ 2 = 
Consequently, the estimation in functional linear model is called ill-posed and additional 
regularity assumptions on the r egression function (3 are necessary in order to obtain a 
uniform rate of convergence (c.f. lEngl et al.1 (|2000M . 

The objective is to estimate the regression function (3, as well as its derivatives, when 
observing a sample (Yi,Xi) of n i.i.d realizations of (Y, X). We can define the empirical 
estimators of g and T respectively as follows 



1 n 

i=l 



and r : = 



1 n 

i=l 



X, 



(1.5) 



The main class of estimation procedures studied in the statistical literature are based on 
principal components regression and consist in reducing the dimension by inverting equation 
(|1.3p in the finite dimensio n space gene r ated by the eigenfunctions o f T associated to th e 
largest eigenvalues (see e. q. Bosal ( 20001) . iFrank and Friedman ( 19931 ). Cardot et al. ( 19991 ). 



Cardot et aD (|2007h or iMiiller and Stadtmullerl (120051 ) in the context of generalized linear 



models). 



The second important class of estimators relies on minimizing a penalized least square s 



criterion which can be seen as generalization of the ridge regression. iMarx and Eilers 



quarcs 
(|l999T ) 
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and ICardot et all hopj ) proposed B-splines expansion of the regression function with a 
penalty dealing with th e squa red norm of a fixed order derivative of the estimators. More 
recently ICrambes et al.l (|2008l ) proposed a spline smoothing decomposition with the same 
type of penalty and proved the optimality of their estimators according to a criterion that 
can be interpreted as a squared error of prediction. Note that this question has given rise 
recently to numerous publications in the machine learning community with similar ideas 
based on reproducing k e rnel Hilbert s paces (RKHS) and Tikhonov regularization (see e.g. 
Smale and Zhou ( 2007 ). Bauer et al. ( 20071 ) and references th erein). 



Borrow ing ideas from the inverse problems community (jEfromovich and Koltchinskii 



(j200ll) and lHoffmann and Eeifj tOQ^ )) we propose in this article a new class of estimators 
which rely on dimension reduction by projecting the data onto some basis of orthonormal 
functions and threshold techniques that allow to control the accuracy of the estimator. More 
precisely, let us consider a set of orthonormal functions such as wavelet or trigonometric 
basis denoted by {tpi, ■ ■ ■ , ip m , • • • } which forms a basis of L 2 [0, 1]. Given a dimension m > 1, 
we denote by [r] m the mxm matrix with generic elements (Tilj£,^j),j,£ = 1, ... ,m and by 
\g\m the m vector with elements (g,tpi),£ = 1, . . . ,m. We can first remark, that the least 
squares estimator of (3 obtained with the projections of the X{ onto the subspace of 
L 2 [0, 1] spanned by the functions • • • , ipm}, is simply given, when [T]m is non singular, 

by ([ r ]mM?]m)*hA]m(-) where [^]m(0 = (i/>i (•),-.. ,Y>m(0)*- 0ur estimator, in its simplest 
form, consists in thresholding this projection estimator when, roughly speaking, the norm 
of the inverse of the matrix [r] m is too large. More precisely, introducing a threshold value 
7 which will depend on m and n we propose to estimate (3 as follows 



P(t) = Y t p l - t{\\[?& < 7 } • Mt), t e [o, l], 



(1.6) 



where the Pi are the generic elements of the vector of coordinates obtained by least squares 
projection and 1 is the indicator function. This new threshold i ng ste p can be seen as an 
improvement of the estimator proposed by iRamsav and Dalzelll (|199ll ) which was built by 
projecting the data onto finite dimensional basis of functions. From an inverse problems 



perspective this approach is similar to the linear Galerkin procedure (|Nattererl (119971 ) or 
Engl et al.l (|2000l ^ defined as follows, P m G ^ m denotes a Galerkin solution of the operator 
equation g = TP when 



\ g -rp m \\< \\g-TPl V/3G* r 



(1.7) 



Since T is strictly positive it follows that p m = [/? m ]U^]m(0 with [/3 m ]„ = [r]^ 1 ^]^ is the 
unique Galerkin solution satisfying [r(/3 — /3 m )]m = 0. It has the advantage compared to 
principal components regression that it does not necessitate to estimate the eigenfunctions 
of the empirical covariance operator. 

We will consider a large class of weighted norms to evaluate the asymptotic rates of 
converge of the thresholded projection estimators. For / £ L 2 [0, 1], we define 



(1.8) 



for some strictly positive sequence of weights (ujj)j^. Then, the performance of the esti- 
mator (3 of P is evaluated according to the risk E||/3 — /3|| 2 , called W^-risk in the following, 
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which is simply the L 2 [0, l]-risk when Uj = 1 for all j £ N. This general framework allows 
us with appropriate choices of the weight sequence uj to cover the estimation of derivatives 
of (3 as well as the optimal estimation with respect to the mean squared prediction error. 
Indeed, the prediction error of a new value of Y given any random function X n+ \ possessing 
the same distrib ution as X and being in dependent of X| , . . . , X n can be evaluated as follows 



(see for exa.mnle ICardot et all or lCrambes et all ffl) for similar setups) 



E 



P(s)X n+1 (s)ds - I f3(s)X n+l (s)ds 



<r(/3 -p),(p -13)). 



Consequently, if we suppose, now for sake of simplicity, that the functions ijjj are also the 
eigenfunctions <j)j of operator T then it is clear that choosing u>j = Xj leads to evaluate, 
according to the w-norm, the mean squared prediction error of the estimator. 

The paper is organized a follows. In section 2, we fix notations and we first derive 
consistency of the estimator in the general case under broad moment assumptions and 
then prove minimax results under some additional regularity assumptions based on a link 
condition between the operator T and the basis {ipj}- Section 3 is devoted to the particular 
case of trigonometric basis and focuses on finitely and infinitely smoothing operator T as 
well as different regularity conditions for the function (3. We first consider the case of mean 
squared prediction erro r and get asymptotic rates of convergence which are comparable to 
those of lCrambes et al.l (]20081 ) in the polynomial case. One remarkable result is that for the 
exponential case, one can attain the parametric rates up to a power of a logn factor. Rates 
of convergence for the function itself and its derivatives are also given. They are similar to 
those obtained by lHall and Horowitz! (|2007l ) in the case of the estimation of the function 
itself. Finally, a brief section 4 presents the concluding remarks and some perspectives. The 
proofs are gathered in the Appendix. 



2 Asymptotic properties, the general case 
2.1 Notations and assumptions. 

We assume from now on that the regression function (3 belongs to some ellipsoid W fe p , p > 0, 
defined as follows 

oo 

W£ := {/ € L 2 [0, 1] : J>K/.lfc>| a =: < p}, (2.1) 
i=i 

where {ipj,j G N} is as before some orthonormal basis in L 2 [0, 1] not necessarily corre- 
sponding to the eigenfunctions of T, and the sequence of weights (bj)j^ is non-decreasing. 
Here captures all the prior information (such as the smoothness) about the unknown 
slope function f3. 

Matrix and operator notations. Given m ^ 1, ^ m denotes the subspace of L 2 [0, 1] 
spanned by the functions {^i, . . . ,ifi m }- n m and denote the orthogonal projections 
on ^ m and its orthogonal complement respectively. Given an operator (matrix) K, 
\\K\\ U denotes its operator W^-norm, i.e. \\K\\ W := sup||j|| a;=1 ||i ; C/|| a ;. The inverse operator 
(matrix) of K is denoted by K -1 , the adjoint (transposed) operator (matrix) of K by K t . 
The identity operator (matrix) is denoted by /. For a vector v and a matrix K, the upper 
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m subvector and m x m sub-matrix is denoted by [v] m and [K] m and its entries by Vi 
and Kij respectively. The diagonal matrix with entries v is denoted by Diag(f). [/] and 
[K] denote the (infinite) vector and matrix of the function / and the operator K with the 
entries [/], = (f,i/>i) and [K]^ = (Kipj,ipi) respectively. Clearly, [n m /]m = [f]m and if we 
restrict H m KH m to an operator from fy m into itself, then it has the matrix [X] m . Moreover, 

iW = [f]\Mmi-) and n m im m / = [f]UK]rMmi-) with ^k(-) = ~- ,^M0)'- 

Consider the covariance operator T. We assume throughout the paper that T is strictly 
positive definite and hence the matrix [r] m is nonsingular for all m G N, so that [r]" 1 always 
exists. Under this assumption the notation T" 1 is used for the operator from L 2 [0, 1] into 
itself, whose matrix in the basis {ipj} has the entries ([r]" 1 )^- for 1 ^ i,j ^ m and zeroes 
otherwise. 



Moment assumptions. The results derived below involve additional conditions on the 
moments of the random function X, which we formalize now. Here and subsequently, we 
denote by X the set of all centered random functions X with finite second moment, i.e., 
E||X|| 2 < oo, and strictly positive covariance operator. Given X G X consider the random 
vector [X] m , then its entries [Xh = (X,tpj) have mean zero and variance = {T ipj, ipj), 
but they are not uncorrelated. In fact, [T] m is the covariance matrix of [X] m . Since T is 
strictly positive definite it follows that [r] m is non singular. Therefore, the random vector 

[rjm 1 ^ 2 ^]™ has mean zero and identity I m as covariance matrix. Then we denote by X^, 
k € N, f] ^ 1, the subset of X containing only random functions X with uniformly bounded 



1 /2 

£>th moment of the corresponding random variables [-X]j/[r] 7 - '. , j £ N, and ([r]m ±/z [^] 
K j ^ m, m £ N, that is 



-1/2 r 



J.J 



X* := ix G X such that supE [Xjj/iT] 1 /'. 



rj 



and sup sup E 



([rL 1/2 [xk),- 



<r?}. (2-2) 



It is worth noting that in case X G X is a Gaussian random function the corresponding 
random variables [Xy[r]V , j G N and ([r]m ' [*]m)j, 1 < j < m, m G N, are Gaussian 
with mean zero and variance one. Hence, for each k G N there exists rj such that any 
Gaussian random function X G X belongs also to Xh. Furthermore, in what follows, £^ 
stands for the set of all centered error terms e with variance one and finite k-th. moment, 
i.e., E|e| fc < rj. 



2.2 Consistency. 

The Woj-risk of (3 is essentially determined by the deviation of the estimators of [g] m and 
[r] m , and by the regularization error due to the projection. The next assertion summarizes 
then minimal conditions to ensure consistency of [3 proposed in (jl.6p . 
Proposition 2.1. Assume an n-sample of (Y,X) satisfying flTTTJ with a > 0. Let (5 G W w , 
X G X^ and e G r\ ^ 1. Consider the estimator (3 with parameter m := m(n) and 
threshold 7 := 7(n) are chosen such that 7 ^ 2||[r]~ 1 || and suppose, as n — > 00, that 
1/m = o(l), 7 (m/n) sup 1 ^ J ^ m {u; : ,} = o(l), (m 2 /n) = o(l) and 7 2 (m 3 /re 1+1 / 2 ) = O(l). // 
in addition sup mgN ||r~ 1 n m rn^|| w < 00, then E||/3 — /3|| 2 = o(l) as n — > 00. 
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Remark 2.1. The last result covers the case u = 1, i.e., the estimator of (3 is consis- 
tent without an additional assumption on (3. However, consistency is only obtained under 
the condition sup^jg^Hr^Hmrn^H^ < oo, which is known to be sufficient to ensure con- 
vergence in the W^-norm as m — ► oo of the Galerkin solution (3 m = [/3 m ]m[V ; ]m( - ) with 
[/3 m ]m = [rlmMslm to the slope parameter j3. Furthermore, if a; is increasing, as in case of a 
Sobolev norm, then (3 is obviously a consistent estimator only if (3 G >V W . Moreover, in the 
last assertion we may replace the condition (3 G W u by the assumption (3 G Wb and (coj/bj) 
is non-increasing. In this situation we have W& C W w and thus the result still holds true. 
Roughly speaking this corresponds to the condition that at least p ^ s derivatives exist in 
case we want to estimate the s-th derivative. □ 



Link condition. In the last assertion the choice of the smoothing parameter m and 7, 
i.e. 7 ^ 2||[r]~ 1 ||, depends on the relation between the covariance operator T associated 
to the regressor X and the basis {ipj} used for the projection, which we formalize next. 
Consider the sequence (Hr^jlDj^ij which is summable and hence converges to zero since V 
is nuclear. In what follows we impose restriction on the decay of this sequence. Therefore, 
consider a strictly positive, monotonically decreasing and summable sequence of weights 
v := (vj)jtzf$ with v\ = 1. Then for s G M denote by \\-\\ v s the associated weighted norm 
given by := X^Li v j I (/> ^j)! 2 - Let M be the set of all self-adjoint nuclear operator 

defined on L 2 [0, 1]. Then for d ^ 1 define the subset N% of M by 

Af*:={retf: ll/ll 2 2 /ci 2 ^l|r/|| 2 ^d 2 ||/||2 2 , v/a 2 [o,i]}. (2.3) 



1951 



A sim ilar c ondition, but in a dif ferent context, can be found, for example, in iNair et al 
20051 s ) and IChen and Reifj (|2008l ). Note, for all T G by using the inequality of iHeind 



it follows tha10 ||r^-|| Vj. Hence, the sequence (vj)j & ^ has to be summable, i.e., 
Y2j Vj < 00, since T is nuclear. We first consider this general class of operator. However, we 
illustrate condition (12. 3p in Section [3] by considering the particular cases of a sequence v with 
polynomial or exponential decay which are naturally linked to polynomial or exponential 
decreasing rates for the eigenvalues of V. To be more precise, if the eigenvalue decomposition 
of T G N is given by {\j,tpj,j G N} then r G My if and only if Xj x d Vj for all j G N. All 
the results below are derived under the following basic regularity assumption. 

Assumption 2.1. Let uj := (u)j)j^>\, b := (bj)j^i and v := (vj)j^i be strictly positive 
sequences of weights with uj\ = 1, b\ = 1 and v± = 1 such that b and (bj are 
non- decreasing and v and (v 2 /ujj)j^\ are non-increasing with A := < oo. 

Note that under Assumption 12. 11 i.e., (bjfujj)j^i is non-decreasing, the ellipsoid is a 
subset of W£. Roughly speaking, if describes p-times differentiable functions, then the 
Assumption [2J] ensures that the Woj-risk involves maximal s ^ p derivatives. On the other 
hand if the sequence uj is decreasing, i.e., the W^-norm is roughly speaking smoothing, the 
Assumption [27TJ excludes cases in which uj decreases faster than the sequence v 2 . However, 
in case u> = v 2 we show below that the obtainable optimal-rate is parametric, and hence, 
whenever (ujj/v 2 ) = o(l) it is parametric too. 

The next assertion summarizes now minimal conditions to ensure consistency of the 
estimator (3 given in (11.61) when the covariance operator satisfies a link condition. 

Corollary 2.2. Assume an n-sample of (Y,X) satisfying (II. ip with a > and associated 
covariance operator F G N^, d ^ 1. Let (3 G W&, X G X* and e G r\ ^ 1. Consider 



1 We write a Xd b if d 1 ^ b/a ^ d. 
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the estimator f3 with threshold 7 = 8d 3 /v m and parameter m := m(n) chosen such that 
1/m = o(l), (m/n) swp^^Uj/vj} = o(l), (m 2 /n) = o(l) and m 3 /(^n 1+1 / 2 ) = 0(1) 
as n — ► 00. // m addition Assumption \2. 1\ is satisfied, then E||/3 — /3|| 2 = o(l) as n — > 00. 

It is worth noting that the link condition T E JVJ* used in the last assertion implies 
sup meN ||r~ 1 n m rn4||o; < 00 and hence ensures automatically the consistency in the W w - 
norm of the Galerkin solution (3 m as m — ► 00. However, in order to obtain a rate of 
convergence it is necessary to impose additional regularity assumption on the slope param- 
eter (3. First we derive a lower bound for any estimator when these regularity assumptions 
are formalized by the condition that (3 belongs to the ellipsoid Wf . 



2.3 The lower bound. 

It is well-known that in general the hardest one-dimensional subproblem does not capture 
the full difficulty in estimating the solutio n of an inverse problem even in case of a known 
operator (for details see e.g. the proof in Mair and Ruymgaart ( 19961 )). In other words, 



there does not exist two sequences of slope functions /?i, n , /?2,n £ VV^ , which are statistically 
not consistently distinguishable and which satisfy ||/?i,n — /?2,n||^ ^ Cb* n , where <5* is the opti- 
mal rate of convergence. Therefore we need to consider subsets of with growing number 
of elements in order to get the optimal lower bound. More preci sely, we obtain the follow- 
i ng lo wer bound by apply i ng As souad's cube technique (see e.g. iKorostolev and Tsvbakov 



(|l993h or IChen and Reif 



nyi ng As so 

m pB)). 



Moreover, the following lower bound is obtained under 
the additional assumption that distribution of the error term e is Gaussian with mean zero 
and variance one, i.e., e ~ Af(0, 1). 

Theorem 2.3. Assume an n-sample of (Y,X) satisfying (|1.1|) with a > and associated 
covariance operator T £ N^, d ^ 1. Suppose the error term e ~ J\f(0, 1) is independent of 
X. Consider W^, p > 0, as set of slope functions. Let m* := m*{n) and 5* := <5*(m*) for 
some A > 1 be chosen such that 



A" 1 < 



nuj 



— V — ^ A and 

.... ^— ' ■)!_• 



= 1 3 



<jJr. 



(2.4) 



If in addition the Assumption 



sup 



El 



1 



is satisfied, then for any estimator (3 of (3 we have 
2 



> — — • mm 

4A 



l2d' AJ 



Remark 2.2. The normality and independence assumption on the error term in the last 
theorem is only used to simplify the calculation of the distance between distributions cor- 
responding to different slope functions. However, below we show an upper bound for the 
estimator (3 in case the error term e £ and the regressor X G for some k € N and 
77 ^ 1 are only uncorrelated, which includes the particular case of an independent Gaussian 
error considered in Theorem 12.31 as long as n is sufficiently large. Therefore, by applying 
Theorem 12.31 an upper bound of order <5* implies that this rate is optimal and hence the 
estimator (3 is minimax-optimal. Note further that if (uj/vj) is summable then the order 
(5* is parametric. This in particular is the case when uj = v 2 since (vj) is summable. □ 
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Remark 2.3. In case the eigenfunctions of the operator T are known, the obtainable accu- 
racy of any estimator of (3 is essentially determined by the decay of the eigenvalues (Xj)j^i 
of r. To be more precise, if for some sequence of weights v := (vj)j^i we have 

3d^l: (2.5) 

then v determines the obtainable rate of convergence (c.f. Johannes (j2008l )). If {ipj} are 



the eigenfunctions of T, i.e., Xj = {Tipj,tpj), then the condition (|2.5p holds if and only if 
r G J\f^. In other words, the condition Y G J\f£ specifies in this situation the decay of the 
eigenvalues of T. However, the set J\f v also contains operators whose eigenfunctions are not 
given by {iftj}. Then the corresponding eigenvalues may decay far slower than the sequence 
of weights v. Hence, for these operators the obtainable rate of convergence may be far 
slower by using the basis {ifrj} m place of their eigenfunctions. □ 

2.4 The upper bound. 

In the following theorem we provide an upper bound for the estimator (3 defined in (|1.6|) by 
assuming sequences b, uj and v with the additional property that 

-^ = 0(1), ^ sup {^} = 0(l)and-^- T = 0(l)forsomeA ;G Nasn^oo, 

(2.6) 

where m* := m*(n) and 5* := 5*(m*) are given by (|2.4jl . The next theorem states that 
the rate <5* of the lower bound given in Theorem 12.31 provides also an upper bound of the 
estimator (3 defined in (II. 6|) . 

Theorem 2.4. Assume an n-sample of (Y, X) satisfying (|l.ip with a > and associated 
covariance operator T G J\f^, d ^ 1. Consider W^, p > as set of slope functions and 
suppose that the sequences b, uj and v satisfy the Assumption \2.l\ Let m* := m*(n) and 
8* n := o~n( n X ^ e 9^ ven by (|2.4p and suppose (12.60 is satisfied for some k ^ 4. Consider the 
estimator f3 with parameter m = m* and threshold 7 = n max(l, 8 <i 3 A/6 mt ). If in addition 
X G X^ k and e G £^ k , r) ^ 1, £/ien u>e /iai>e 

sup Ep-/3||E ^^d 16 AV + M}. 

/3ew b p 

where C is a positive constant. 

Thus, we have proved that the rate 5* is optimal and hence the estimator (3 is minimax 
optimal. 

Remark 2.4. It is worth noting that as long as the sequence b is increasing the condition 
on the threshold 7 given in Theorem 12.41 writes 7 = n for all sufficiently large n. Therefore, 
only the parameter m has to be chosen data-driven in order to build an adaptive estimation 
procedure. On the other hand, under the assumptions of Theorem 12.41 the parametric 
rate cannot be obtained. To be more precise, in case that 'YljUj/vj < 00, the rate of 
the lower bound in Theorem 12.41 is given by <5* = 1/n. But in this case the condition 
m */(^n n ) sw Pl^j^m*{ u; j/ v j} = 0(1) is not satisfied and hence we cannot apply Theorem 
12.41 However, we conjecture that the proposed estimator att ains also the parametr i c rate 
under a stronger set of assumptions as, for example, used by Johannes and Schenk ( 20081 ) 



in order to obtain rate optimal estimation of a linear functional of the slope parameter /?.□ 



S 



3 Mean squared prediction error and derivative estimation 



In this section we will suppose that the slope function /3 is an element of the Sobolev space 
of periodic functions W p for some p > given by 

W p = {/ G H s : /Ctf(o) = j = 0, 1, . . . ,p - l}, 

wher e H v :=\f £ L 2 [0 1] : / (p- 1 ) absolutely con t inuou s , / M £ L 2 [0 1]| is a Sobolev space 
(c.f. iNeubauerl (jl988al lbh. IMair and RuvmgaartJ (jl99fih or ITsvbakovl J200I)). Let us first 



(3.1) 



remark that if we consider the sequence of weights (6^) jG n given by 

&p = 1 and 6^. = 6^. +1 =j 2 P, j £ N, 
and the trigonometric basis 

Mt) = 1, *fak(t) = v / 2cos(27r/ct), ihk+\(t) = v / 2sin(27rH) ) /c = 1, 2, ... . (3.2) 



then the Sobolev space of periodic functions is equivalently given by WV defined in (|2.ip . 
Therefore, let us denote by Wp := W^,, p > 0, an ellipsoid in the Sobolev space W p . 

Mean squared prediction error. We shall first measure the performance of the esti- 
mator by considering the mean prediction error (MPE), i.e., E||/3 — /3||p. In this case, if T 
satisfies a link condition, that is T £ My, d ^ 1, for some w eight sequence v (see definition 
I2.3H . then it follows by using the inequality of Heinj ( 195ll ) that the MPE is equivalent to 
the W„-risk, that is E||^-/?|| 2 . To illustrate the previous results we assume in the following 
the sequence (vj) m& fq to be either polynomially decreasing, i.e., v\ = 1 and Vj = |j|~ 2cl , 
j ^ 2, for some a > 1/2, or exponentially decreasing, i.e., v± = 1 and Vj = exp(— |j| 2a ), 
j ^ 2, for some a > 0. In the polynomial case easy calculus shows that a covariance oper- 
ator r £ Af^ a cts like integrating (2a)-times and hence it is called finitely smoothing (c.f. 



Nattererl ( 1984 )). Furthermore, if the eigenfunctions of T are {ipj}, then T £ My holds if 
and only if the eigenvalues A,- of T satisfy Xj |i|~ 2a , which is the case considered, for 



example, in lCrambes et al 



( 20081 ) . On the other hand in the exponential case it can easily 
be seen that the link condition T £ My i mplies 7£(r) C W p for all p > 0, therefore the oper- 
ator r is called infinitely smoothing (c.f. Mair ( 19941 )). Moreover, if the eigenfunctions of V 
are {ipj}, then T £ My holds if and only if the eigenvalues Aj of V satisfy Xj exp(— j 2a ). 
To the best of our knowledge this case has not been considered yet in the literature. Since 
in both cases the basic regularity assumption 12.11 is satisfied, the lower bounds presented 
in the next assertion follow directly from Theorem 12.31 Here and subsequently, we write 
«n % b n when there exists C > such that a n ^ Cb n for all sufficiently large n £ N and 
a n ~ b n when a n ^ b n and b n < a n simultaneously. 

Proposition 3.1. Under the assumptions of Theorem \2.3\ we have for any estimator (5 
(i) in the polynomial case, i.e. v± = 1 and Vj = \j\~ 2a , j 2, for some a > 1/2, that 

-(2pH 



'} 



n 



J J — Ml 
-2a)/(2p+2a+l) 



(ii) in the exponential case, i.e. v\ = 1 and vj = exp(— \j\ a ), j ^ 2, for some a > 0, that 
su P/3eVV p{E||^-/3|| 2 } ^n-^ogn) 1 /^. 



9 



On the other hand, if the dimension parameter m and the threshold 7 in the definition 
of the estimator (3 given in (|1.6p are chosen appropriately, then, by applying Theorem 12. A\ 
the rates of the lower bound given in the last assertion also provide, up to a constant, the 
upper bound of the risk of the estimator (3, which is summarized in the next proposition. 

Proposition 3.2. Under the assumptions of Theorem \2.3\ consider the estimator (3 

(i) in the polynomial case, i.e. v% = 1 and Vj = \j\~ 2a , j ^ 2, for some a > 1/2, with 
m ~ n 1 /i 2 P+ 2a + 1 ) and threshold 7 = n. If in addition k ^ 2 + 8/(2p + 2a — 1), then 

SU P/3eVVpP {E||£- } < n -{W), { 2 P+ 2a + l) ) 

(ii) in the exponential case, i.e. v\ = 1 and Vj = exp(— |j| 2a ), j ^ 2, for some a > 0, with 
m ~ (logn) 1 ^ 20 ) and threshold 7 = n. Then 

su P/J6W ,{E||0 - /3|| 2 } < n-Hlogn) 1 ^. 

We have thus proved that these rates are optimal and the proposed estimator (3 is 
minimax optimal in both cases. It is worth noting that replacing the condition 7 = n by 
7 = cn with c > appropriately chosen, Proposition 13.21 remains true when p = 0, that is 
to say when (3 is just supposed to be square integrable. 



Remark 3.1. It is of interest to compare our results with those of iCrambes et al.l (1200 
who measure the performance of their estimator in terms of squared prediction error. In 
their notations the decay of the eigenvalues of T is assumed to be of order {\j\~ 2q ~ l ), i.e., 
q = a — 1/2. Furthermore they suppose the slope function to be m-times continuously 
differentiable, i.e., m = p. By using this parametrization we see that our results in the 
polynomial case i mply the same rate of convergence in probability of the prediction error as 
it is presented in Crambes et al. ( 20081 ). However, from our general results follows a lower 



and an upper bound of the MPE not only in the polynomial case but also in the exponential 
case. 

Furthermore, we shall emphasize the interesting influence of the parameters p and a 
characterizing the smoothness of (3 and the smoothing properties of T, respectively. As we 
see from Propositions 13.11 and 13.21 in the polynomial case an increasing value of p leads 
to a faster optimal rate. In other words, as expected, a smoother regression function can 
be faster estimated. The situation in the exponential case is extremely different. It seems 
rather surprising that, contrary to the polynomial case, in the exponential case the optimal 
rate of convergence does not depend on the value of p, however this dependence is clearly 
hidden in the constant. Furthermore, the parameter m does not even depend on the value 
of p. Thereby, the proposed estimator is automatically adaptive, i.e., it does not involve 
an a-priori knowledge of the degree of smoothness of the slope function f3. However, the 
choice of the smoothing parameter depends on the value a specifying the decay of {vj}. 
Note further that in both cases an increasing value of a leads to a faste r optim al rate of 



convergence, i.e., we may call 1/a as degree of ill-posedness (c.f. iNattererl ()1984l )). □ 



Estimation of the derivatives. Let us consider now the estimation of derivatives of 
the slope function (3. It is well-known, that for any function g belonging to a Sobolev- 
ellipsoid Wp the Sobolev norm \\g\\b s for each ^ s ^ p is equivalent to the L 2 -norm of 
the s-th weak derivative g( s \ i.e., ||</ s ^||. Thereby, the results in the previous Section imply 
again a lower bound as well as an upper bound of the L 2 -risk for the estimation of the 
s-th weak derivative of (3. In the following we consider again the two particular cases of 
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polynomial and exponential decreasing rates for the sequence of weights (vj). The next 
assertion summarizes then lower bounds for the L 2 -risk for the estimation of the s-th weak 
derivative of /3 in both cases. 

Proposition 3.3. Under the assumptions of Theorem \2.3\ we have for any estimator (3^ 

(i) in the polynomial case, i.e. v\ = 1 and Vj = \j\~ 2a , j ^ 2, for some a > 1/2, that 

SU P/3eVV p{E||^ s ) - (3^\\ 2 } > n -(2 P -2s)/(2 P+ 2a+l) ; 

(ii) in the exponential case, i.e. V\ = 1 and Vj = exp(— |j| 2a ), j ^ 2, for some a > 0, that 

su P/3eW p{E||^) -f3^\\ 2 } > (logn)-(f- s )/ a . 



On the other hand considering the estimator (3 given in (II. 6p . we only have to calculate 
the s-th weak derivative of (3. Given the exponential basis, which is linked to the trigono- 
metric basis by the relation exp(2i7r£;t) = 2 -1 / 2 (ip2k(t) + 1 V , 2fc+i(^)) 5 f° r k £ I* and t £ [0, 1], 
with i 2 = —1, we recall that for ^ s < p the s-th derivative (3^ of [3 in a weak sense 
satisfies 

j3^ s \t) = V(2ivrA;) s ( / j3(u) exp(-2ivr£;u) du) ex.p(2mkt). 

fcGZ ^0 ' 

Given a dimension m ^ 1, we denote now by [r] m the (2m + 1) x (2m + 1) matrix with 
generic elements (Tip£,ipj),— m ^ j,£ ^ m and by \g\ m the 2m + 1 vector with elements 

^ 1/2 

(g, ipft), —m ^ £ ^ m. Furthermore for integer s define the diagonal matrix xjm with entries 
S/Vf '■= (2mj) s , — m ^ j ^ m. Then we consider the estimator of (3^ defined by 



P (s) ■= is) U^U-) with 

Vm 2 [f]m[3]m, if [F]m is nonsingular 

and||[fLT^ 7 , (3.3) 
0, otherwise. 



{ 



Furthermore, if the dimension parameter m and the threshold 7 in the definition of the 
estimator (3^ given in f|3.3[) are chosen appropriately, then by applying Theorem 12.41 the 
rates of the lower bound given in the last assertion provide up to a constant again the upper 
bound of the L 2 -risk of the estimator j3^ s \ which is summarized in the next proposition. We 
have thus proved that these rates are optimal and the proposed estimator [3^ is minimax 
optimal in both cases. 

Proposition 3.4. Under the assumptions of Theorem ] 2. 3\ consider the estimator (3^ 

(i) in the polynomial case, i.e. v± = 1 and Vj = \j\~ 2a , j ^ 2, for some a > 1/2, with 
m ~ n 1 /( 2 P+ 2a + 1 ) an( l threshold 7 = n. If in addition k ^ 2 + 8/(2p + 2a — 1), then 

SU P/3eVv p{E||^ s ) ~(3^\\ 2 } < n -(2p-2s)/(2 P +2a+l) ) 

(ii) in the exponential case, i.e. v\ = 1 and Vj = exp(— |j| 2a ), j ^ 2, for some a > 0, with 
m ~ (logn) 1 ^ 20 ) and threshold 7 = n. Then 

su P/3eW p{E||^) ~(3^\\ 2 } < (logn)-(f- s )A\ 
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Remark 3.2. It is worth noting t hat the L 2 -risk in e s timat ing the slope function (3 itself, 
i.e., s = 0, has been considered in lHall and Horowita (120071 ) only in the polynomial case. 

In their notations the decrease of the eigenvalues of T is of order i.e., a = 2a. 

-P 



Furthermore the Fourier coefficients of the slope function decay at least with rate j 
i.e., (3 = p + 1/2. By usi ng this new parametrization we see that we recover the result of 
Hall and Horowita (120071 ) in the polynomial case with s = 0, but without the additional 
assumption (3 > a/2 + 1 or (3 > a — 1/2. 

Furthermore, we shall discuss again the interesting influence of the parameters p and a. 
As we see from Propositions 13.3 1 and 13.41 in both cases an decreasing of the value of a or an 
increasing of the value p leads to a faster optimal rate of convergence. Hence, in opposite 
to the MPE by considering the L 2 -risk the parameter a describes in both cases the degree 
of ill-posedness. Furthermore, the estimation of higher derivatives of the slope function, 
i.e. by considering a larger value of s, is as usual only possible with a slower optimal rate. 
Finally, as for the MPE in the exponential case the parameter m does not depend on the 
values of p or s, hence the proposed estimator is automatically adaptive. □ 

Remark 3.3. There is an interesting hidden issue in the parametrization we have cho- 
sen. Consider a classical indirect regression model with known operator given by T, i.e., 
Y = \Tj3](U) + e where U has a unif o rm di stribution on [0,1] and e is white noise (for 
details see e.g. 
smoothing, i.e., 



Mair and Ruvmgaart ( 19961 )). If in addition the operator V is finitely 
(vj) is polynomially decreasing with Vj = j~ 2a , j ^ 2, then given an n- 
sample of Y the optimal rate of convergenc e of the W. 9 -risk of any estim ator of (3 is of order 
-2(p- a )/[2(p+2a)+i] i since up) = W 2a (cf. iMair and Ruvmgaartl (|l996h or lChen and Reifj 
(|2oo8h ;i. However, we have shown that in a functional linear model even with estimated 
operator the optimal rate is of order n~ 2 ^ p ~ s ^^ p+a ^ +1 ^ . Thus comparing both rates we see 
that in a functional linear model the covariance operator T has the degree of ill-posedness a 
while the same operator has, in the indirect regression model, a degree of ill-posedness (2a). 
In other words in a functional linear model we do not face the complexity of an inversion of 
r but only of its square root T 1 / 2 . This, roughly speaking, may be seen as a multiplication 
of the normal equation YX = (f3,X)X + Xe by the inverse of T 1 / 2 . Remarking that T 
is also the covariance operator associated to the error term eX, the multiplication by the 
inverse of T 1 / 2 leads, roughly speaking, to white noise. □ 



4 Concluding remarks and perspectives 

We have proposed in this work a new kind of estimation procedures for the regression 
function and its derivatives in the functional linear model and proved they can attain 
optimal rates of convergence. 

These estimators depend on two parameters which play the role of smoothing parame- 
ters, the dimension m of the projection space and the threshold value 7. Building data driven 
rules that can permit to choose automatically the values of these parameters is certainly a 
topic that deserves fur ther attention and one promi s ing d ir ection is to adapt the selection 
techni que proposed in Efromovich and Koltchinskii ( 200ll ). Goldenshluger and Pereverzev 
(|200(i l and lTsvbakovl (|200(t . 

Another point of interest is to extend the thresholding approach in order to consider 
different thresholding rules for different coordinates in the considered basis. This could 
lead for instance with wavelet basis to estimators that would adapt to sparseness as well as 
varying regularity of the regression function. 
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A Appendix: Proofs 



A.l Proofs of Section [2] 

We begin by denning and recalling notations to be used in the proofs of this section. Given 
m > 0, a Galerkin solution of g = Tf3 is denoted by (3 m S ^ m (see equation (jl.7jl ). 
Furthermore, we use the notations 

(3 m := [niJ^W With \r\rn == [0 m ]rnH II [f J" 1 1| < 7 }, 

~ 1 71 - _ 1 n _ _ 

[r]m = — '^[X i ] I n[Xi] t nk , [Xj]m : = Flm [-^»k [f]m := - 

i=l i=l 

^ n n 

pti m := [r]m ~~ ^mi Plijm := — / (Xi, f3 — /3 m )pQ] m , [W n ] m := — > ei[Xj] m , 

(A.l) 



n 1 — ' — — n 

i=l i=l 



where - [ry/? m ]™ = [T n ]™ + [W„] m with E[T n ]™ = [r(/3-/3 m )]„ = and E[W n ] m = 0, 

- ~ 1/2 ^ 1/2 

= [TU, [T] m = [T]m ' TkFk ' and hence E[ 

^n]m = 0. Moreover, let us introduce 

the events 

: = (lift 1 !! < t}, o 1/2 :={||[s n yKi/2} 

n c :={|||flffill>7} and ^ /2 = {||[H„y|>l/2}. (A.2) 

Observe that Q1/2 C Q in case 7 ^ 2||[r]~ 1 ||. Indeed, if ||[S n ] ffi || ^ 1/2 then the identity 
[r]m = [r]m 2 {-^+ [S n ]m} [r]m 2 implies by the usual Neumann series argument that || [r]" 1 1| ^ 
2 ll[ r ]m 1 H- Thereby, if 7 ^ 2||[r]^ 1 ||, then we have f2 1/2 C O. These results will be used 
below without further reference. 

We shall prove in the end of this section the two technical Lemma IA.1I and IA.2I which 
are used in the following proofs. 

Proof of the consistency. 

Proof of Proposition 12. 1L The proof is based on the decomposition 

np-pwi ^ 2{e\\p- n\i +E\\r - pfj. (a.s) 

Since 7 ^ 2 1 1 [r] 1 1 it follows that fl c C Qy 2 ano ^ hence 

nr - 0\\i ^ 2{\\r - + wrwl p(n? /2 )}. (a.4) 

On the other hand we show below for some constant C > the following bound 

MP- P m \\l < C ■ ||[Diag(o;)]V2 [r]^/ 2 || 2 (m/n) r, {a 2 + \\(3 - /3 m || 2 E||X|| 2 } 

{l + 7 2 m 2 /nr ? - 1 /2(P(05 /2 ))V2|| [ r] Hk f} ) ( A .5) 

where by applying Markov's inequality (1A.12|) in Lemma lA.ll implies P(p,°, 2 ) ^ Ci]m 2 /n 
for some C > 0. Moreover, ||[ry| 2 < ||r|| 2 and || [Diag(u;)]^ 2 [r]„ 1/2 || 2 < 7 su Pl ^ m {w i } 
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since 7 ^ 2||[r]m || 2 , which by combination of (|A.4j) and (1A.5|1 leads to the estimate 

- P\\lZ C { ll/T -fi\\* + \\n\i (m 2 M V 
+ 7 sup {a; j }(m/n)r ? {cj 2 + ||/3-/3 m || 2 E||X|| 2 }{l + 7 2 (m 3 /n 1+1/2 )||r|| 2 } (A.6) 

for some C > 0. Furthermore, for each G W w , we have ||/3— P m \\ui = o(l) asm-* 00, which 
can be realized as follows. Since ||n^/3|| = o(l) and Hn^/311^ = o(l) as m -> 00 by using 
Lebesgue's dominated convergence theorem, the assertion follows from the identity [n m /3 — 

P m ]rn = -[r]^[rn^/3] m by using that ||n m /3 - p m \\ u < ||n^/3|U su Pm ||r- 1 n m rn^|U = 

0(1111^/311^). Consequently, the conditions on m and 7 ensure the convergence to zero as 
n — > 00 of the bound given in (|A.6j) . which proves the result. 

Proof of (TA~5l) . From the identity - [f}rn[f3 m }m = [T n ]m + [W n ]m it follows that 

E||i9 - ^ m || 2 = E||[DiagM]V 2 {[r]" 1 + fl-^trim - PUPT^ > ^ T ^+ Wmlfln- 

Since 2 1 1 [r] ^j, 1 1 1 ^ 7 we have Sl 1 / 2 C ^, and hence by using || [r]" 1 || 2 1q ^ 7 2 we obtain 

m-nt < 3||[Diag(u,)]V 2 [ T ]^f {E\\[T]^{[T n ] m + [W^jf 

+ 7 2 l|[ryi 2 (E||[s n y| 8 )V^ 

+ E|| {j + [H,,]^}- 1 1| 2 1| [H^un 2 1| [r]^ 1 / 2 + [W^]^}!! 2 !^,/, } . 

From (lATT0]l - (^A~T2l) in Lemma EH together with ||{J + [H n y} _1 || ||[E n y|ln 1/2 < 1 follows 
then (|A.5p . which completes the proof. □ 

PROOF of Corollary 12. 2[ The link condition T G implies 2 1 1 [r] 1 1 < 8d 3 /v m = 7, 

|| [Diag(u)]m 2 [rlm. 1 ^ 2 )! 2 ^ 4d 3 sup^ ^ m {u>j/vj} and ||[r]m|| 2 ^ d 2 by using the estimates 
(|A.16p . ()A.17p and (|A. 18|) in Lemma lA.31 respectively. Therefore, by combination of (|A.4p 
and (IA.5P in the proof of Proposition 12.11 we obtain 

E||^-/3|| 2 ^C-{||/r-/?|| 2 +\\n\l {m 2 /n)r 1 + d i sup {ljj/vj} (m/n) 

V {a 2 + ||/3 - /3 m || 2 E|| A|| 2 } {1 + m 3 /(n 1+1 / 2 ^) d 8 } } (A.7) 

for some C > 0. By using the identity [n m /3 — (3 m \m = -[rl^lm^/JIm and the estimate 
(|A.23p in the proof of Lemma IA.3I with b = uj the link condition T G A/ijy implies further 
that Hr^nmrn^ll 2 = sup^^lln^ - /3 m || 2 < 2(1 + d 2 ) for all m G N. Therefore we 
have ||/3 — /3 m || w = o(l) as m -> oo for each /3 G W^. Consequently, the conditions on m 
and 7 ensure the convergence to zero as n — > oo of the bound given in (CO) , which proves 
the result. □ 

Proof of the lower bound. 

PROOF of Theorem 12.31 Let Xi, i G N, be i.i.d. copies of X with associated covari- 
ance operator V belonging to Af^. Then for each j, [Xi]j is centered and has variance 
ELY] 2 = (Tipj,ipj) ^ Vjd. This result will be used below without further reference. Con- 
sider independent error terms ~ AA(0, 1), i G N, which are independent of the random 
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functions {X}. Let 9 G { — l,l} m *, where m* := m*(rt) G N satisfies (|2.4p for some 
A ^ 1. Define a m*- vector u of coefficients iij satisfying (IA.14h in Lemma lA,2[ For each 
9 we consider a slope function f3 e := Y^7=i®j u j' l l ) 3 e by using (|A.15|) in Lemma [A.21 
Consequently, for each 9 the random variables (Yi,Xi) with 5^ := P 9 (s)Xi(s)ds + aei, 
i = 1, . . . ,n, form a sample of the model (II. ip and we denote its joint distribution by P#. 
Furthermore, for j = 1, . . . , m* and each we introduce 9^ by 0^ = 9[ for j 7^ / and 
9j = —9j. As in case of Pg the conditional distribution of Y{ given Xj is Gaussian with 
mean YlJ^i Qj u j[Xi]j and variance cr 2 it is easily seen that the log-likelihood of Pgu) w.r.t. 
Pg is given by 



Tt 7TX* Tl 

log (^) = ~h -E^w}^^ - ^ E« 

i=l J=l i=l 



and its expectation w.r.t. Pg satisfies Kp g [log(dPg(j) /dPg)] ^ —2nduj Vj/a . In terms of 
Kullback-Leibler divergence this means KL(P d (j), Pg) ^ 2ndu 2 Vj/a 2 . Since the Hellinger 
distance H{P g ^) , Pg) satisfies H 2 (P e (j) , Pg) ^ KL(P e (j) , Pg) it follows from (|A.15|> in Lemma 
El that 

2ra<i 

H 2 (P eUh P e ) ^ — . u 2 -v 3 ^ 1, i = l,...,m*. (A.8) 



Consider the Hellinger affinity p(Pgu) , P#) = / y/ 'dPg^dPg, then we obtain for any estimator 
P of P that 



\{P e - p eU \^)\ 2 1 y J \{P e -P eU \i>j)\ 

Due to the identity p(P eU) ,P e ) = 1 - ±H 2 (P gU) ,P g ) combining fO]) with (|A~9|) yields 

{E eU) |^-/ J) ,^)| 2 + E,|^-/3 e ,^)| 2 } ^^ 2 , j 

From this we conclude for each estimator /3 that 
supE||^-/3|| 2 > sup E e ||^-/3 e || 2 

/3GW, P 6>G{-l,l} m * 



1, . . . , m*. 



^ E E^w-/^>i 2 

0e{-i,i} m * j=i 



1 m * 1 

= ^7 E E^2{ Ee i^-^'^i 2+E ^i^-^ <J) '^i 2 } 

ee{-i,i} m * j=i 
^ 1 2 1 • / ^ 2 P 1 ^ 

where the last inequality follows from (|A. 15|) in Lemma |A. 21 which completes the proof. □ 
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Proof of the upper bound. 

PROOF of Theorem I2.4L Our proof starts with the observation that the link condition 
r G N£ implies 2||[r]- 1 || < 8d 3 /v m , ||[Diag(w)]^ 2 [r]^ 1/2 || 2 < 4d 3 sup^^Uj/v,} and 
||[r]m|| 2 < d 2 by using the estimates (|A.16|) . (|A.17|) and (|A.18|) in Lemma [Ol respectively. 
Moreover, for all X G X^ k by applying Markov's inequality (lA.12j) in Lemma IA.1I we 
have P(Qy 2 ) ^ Cr]m 2k /n k for some C > 0. Furthermore, by using the definition of m* 
the condition m = implies l/v mst nA/b„ lf and hence 7 = nmax(l, 8d 3 A/6 m „) ^ 
2||[r]~ 1 1| . Therefore, from (|A,4h and (1A.5|) in the proof of Proposition 12.11 follows 

2 (~< \ \\ fi m * /Q 1 1 2 1 II om, 1 1 2 ( rm 2k /„fc\ 



E||/3-/3|| 2 ll/T* - + ||/3 m * || 2 (mf/n^ + d 3 sup {u,,>;} K/n) 



?? {a 2 + ||/3 - /3 m * || 2 E||X|| 2 } {1 + m 2 + k /(n k / 2 - 1 ) d 8 A 2 }} 



for some C > 0. Consequently, the definition of 5* by using (|A.19|) in Lemma \A.3\ i.e., 
11/3 - P m *\\l < 10(i 4 p<5;, and E||X|| 2 ^ dA, implies 

e||£-/3|| 2 <c^nd 18 AV + M} 

{l + m 2fc /(,5> fc )+m*/(5» sup {u^} ) {l + m 2 + k /(n^ 1 )} 

Thereby, the result follows from the condition (|2.6p which ensures that the factors in braces 
are bounded as n — > 00, which completes the proof. □ 

Technical assertions. 

The following two lemma gather technical results used in the proof of Proposition 12. 1|, 
Theorem 12.31 and Theorem 12.41 

Lemma A.l. Suppose X G X^ k and e G £^ k , k G N. Then for some constant C > only 
depending on k we have 

k 

nm^ /2 W n . m \\ 2k <C-^-a 2k -n, (A.10) 
E\\[T]^/ 2 T n , m f k <C~.\\P- (3 m \\ 2k ■ (E\\X\\ 2 ) k • r), (A.ll) 



E\\E njm \\ 2k <C- V ~, (A.12) 

2k 

E||{[rk - [hnW]nl /2 \\ 2k <C- V ~- (E\\X\\ 2 ) k (A.13) 

PROOF. Let W := [T]^ /2 W n ^ then E||[r]^ 1/2 Wn )rri || 2fc < m^ 1 YJJL1EW 2 k , where Wj = 
(1/n) YH=i The random variables (ej[Xj]j), 1 < i < n, are independent and 

identically distributed (i.i.d.) with mean zero. Since X G X* k and e£^, (jATToTl follows 



from Theorem 2.10 in iPetrov! dl995|), that is, EW 2k < Cn~ k a 2k E\e[X)j\ 2k < Cn~ k a 2k T] for 
some constant C > only depending on k. 

Proof of (|ATTT1) . Due to E(/3 - /3 m ,X)[X]„ = [r(/3 - /3 m )]„ = 0, i.e., the random 
variables ((/3 — /3 m , Aj)[Aj] m ), 1 ^ i ^ n, are i.i.d. with mean zero. Furthermore, we claim 
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that X E X^ k implies E|(/3 - (3 m , X) [X]A 2k < C • n • \\(3 - /? m || 2fc (E||X|| 2 ) fc , for each j E N. 
Then the estimate (jA.lip follows in analogy to (jA.lOp . Indeed, we have E|[AT]j| 4fc ^ r\ and 



2k 

31 hk (=1 

< ||/3-/3 m || 4fc (EIIXII 2 )^??, 



which imply together the assertion by using the Cauchy-Schwarz inequality. 

Proof of CQ2j) . From the identity (~„, m ) jV = (1/n) E^^-LY;], - fy> with = 1 
if j = Z and zero otherwise, we conclude E(S n!m ) 2 ^ < C"n~ fc E|[X]jLY]z - 6ji\ 2k . Thus 
X £ Af 4fc implies E||S nim || 2fc < m 2 ^" 1 ) ^ jV E(H njm ) 2 ^ < Cm 2k nS- 

The estimate (1A.13|) follows by using the identity {[r]^ — [rjmjfrjm 1 ^ 2 = [r]m 2 S nim 
from (|A.12p . which completes the proof. □ 



Lemma A. 2. Let m* £ N and 5* be chosen such that (|2.4p is satisfied for some A ^ 1. 
Consider a ( infinite ) vector u with components Uj satisfying 



u 



2 = _iL_ j j £ N, totf/i C := min{cr 2 /(2(i),p/A} , (A.14) 



n ■ v 



.1 



then under Assumption \2.1\ we have for all j £ N 

2nd 2 2 2 f 0-2 P 1 

-^T M i u i < !» 2^ u i h i < P> anrf 2^ u i W J ^ mm I 2d' A I A ' ^ ' 

Proof. The first inequality in (1A.15|) follows trivially by using the definition of £, while 
the definition of m* given in (|2.4p together with Assumption 12.11 i.e., (bj/ujj) is non- 
decreasing, implies the second, i.e., X^j=i u ] bj ^ C^m,/^m, I^mi u j/( nv j) ^ C A ^ P- 
To deduce the third estimate from the definition of and 5* observe that ^f^j = 

i5*(6 m ,/ w m „ X] j^i u j/{ nv j) ^ ^(/A, which proves the lemma. □ 

Lemma A. 3. Suppose the sequences b, uj and v satisfy Assumption \2.1[ Let V £ J\f^. Then 

supj^lllr]" 1 / 2 !! 2 ) < {2d 2 (2d 4 + 3)} 1 / 2 < Ad 3 , (A.16) 

sup{||[Diag(t;)]V 2 [r]- 1 / 2 || 2 W{2d 2 (2d 4 + 3)} 1 / 2 ^4d 3 , (A.17) 

supllllDiag^)]- 1 / 2 ^^ 2 !! 2 }^^ (A.18) 

mGN L — — J 

If in addition (3 m denotes a Galerkin solution of g = Tf3 with (3 £ then 

suv\b m /uj m \\(3-(3 m \\l} < 2(2<i 4 + 3) p < 10d 4 . (A.19) 



Proof. We start our proof with the observation that the link condition r £ Mf , imp lies 
that r is strictly positive and that for all \s\ ^ 1 by using the inequality of iHeina ()195ll ) 

rf- 2W ll/ll 2 - < lirYII 2 <rf 2|s| ll/ILV (A.20) 



17 



Consider g G Then (|A.20p implies (3 := F^g G L 2 [0, 1] by using that ||g||„-2 = 

||[Diag(i;)]~ 1 [5r]m,|| < oo. Furthermore, f3 m = \F\^[g\m is the unique Galerkin solution of 
(|1.7|) . By using successively the first inequality of (|A.20|) . the Galerkin condition (|1.7|) and 
the second inequality of (|A.20|) . we obtain 

W - P m C < d 2 \\r((3 - p m )f ^ d 2 \\r(p - n m/ 3)|| 2 ^ d 4 \\p - u m pf v2 (A.21) 

Since (vj) is monotonically decreasing it follows n m /?|| 2 2 ^ ||/3|| 2 and, hence by using 
(|A.20p with s = — 1 we have — II m /3|| 2 2 ^ d 2 v 2 ^ ||<?|| 2 _ 2 . Combining the last estimate 
with (1A.21|) we obtain 

||/3 m - IW?|| 2 2 < 2{||/3 - /T || 2 2 + 11/3 - n m /3|| 2 2 } < 2d 2 (d 4 + 1) v 2 m \\ g \\l_ a 

which together with ||/|| 2 ^ v~ 2 \\f\\l 2 for all / G * m leads to 

\\p m - iw?|| 2 ^ v- 2 \\p m - rw?|| 2 2 < 2d 2 (d 4 + 1) \\ g \\l_ a . 

By using the last estimate together with ||g||„-2 = || [Diag(t;)]~ 1 [<?] m || we conclude that 

mn^bkii 2 = \\n\ 2 < 2{\\r - iw3ii 2 + niw3ii 2 } 

«C 2d 2 (2d 4 + S)!!^^)]^ 1 ^!! 2 , Vg G VtW (A.22) 
Then, from (TAT22]) follows by using the inequality of lHeiri3 (|l95ll ) for all g G ^ m 

ll[rL 1/2 [9kl| 2 < {2d 2 (2d 4 + 3)} 1 / 2 ||[Diag(t;)L 1 /2 bk ||2 ) 

which implies together with || [Diag(t> 1| = f" 1 the estimate (|A.16p . and furthermore by 
replacing [g]m by [Diag(f)k [g]m the estimate (IA.17j) . that is, 

|| [r]^ 1 / 2 [Diag(^)] ^2 [^]_ ||2 ^ {2d 2 {2d 4 + 3)} l/2 {l[gUl 2^ ^ e ^ 

Proof of (|A.18p . By using the second inequality of (|A.20j) together with ||n m || = 1 we 
obtain 

IIFkbkll 2 = ||n m r 5 || 2 ^ ||r 5 || 2 ^ d 2 \\ g \\ 2 v2 = d 2 ||[Diag(i;)k[3kll 2 , e * m 

and hence the inequality of Heind ( 195ll ) implies 

l|[r]J( 2 bkll 2 < dlllDiag^)]^ 2 ^]™!! 2 , v 5 g * m . 

1/2 

Thereby, (1A.18I) follows by replacing [g]m by [Diag(t>)k [g]m, that is, 

Hin^lDiag^)]^ 1 / 2 !^™!! 2 < dH^^II 2 , Vg G \P m . 
Proof of (|A.19p . Let (3 G W^. Consider the decomposition 

II/? - /HI 2 < 2{||/3 - n m /3|| 2 + ||n m /3 - /3 m || 2 }. 
Since (ujj/bj) is non-increasing it follows — n m/ 9|| 2 ^ to m /b m ||/3|| 2 , while we show below 

||n m /3 - /T|| 2 ^ 2(1 + d 2 )u; m /& m (A.23) 

Consequently, by combination of these two bounds the condition f3 G W^, i.e., ||/3|| 2 ^ p, 
implies (grj . From (TA~2ll follows ||/3-/? m || 2 2 ^ d 4 ||/? - n m /3|| 2 2 ^ d^/M^II 2 because 
(v 2 /bj) is non-increasing, and hence, 

||n m /3 - /3 m || 2 2 < 2{||/3 - rf v 2 + ||/3 - n m /3|| 2 2 } < 2(1 + d 4 )v 2 Jb m \\[3\\ 2 . (A.24) 

Furthermore, ||II m /3 — /3 m || 2 ^ w m w~ 2 ||II m /3 — /3 m || 2 2 since (ujj/v 2 ) is non-decreasing. The 
last estimate and ()A.24p imply now together ()A.23p . which completes the proof. □ 
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A.2 Proofs of Section [3] 



The mean prediction error. 

Proof of P roposition |3~T1 Since T G J\[£, d ^ 1, it follows by using the inequality of 



HeinzJ (Il95lh that E |/3 — /3||p E||/3 — /3|| 2 . Therefore, we can apply the general results by 



considering the W^-risk with w = t;asa measure of the performance of an estimator of /3. 
Furthermore, in case (i) the definition of h p - and Vj imply together (l$n,/uJm f ) Y^=i UJ j/ v j = 

m* a+2p+1 . It follows that the condition on to* and <5* given in (|2.4j) of Theorem 12.31 can 
be rewritten as m* ~ n i/(2p+2a+i) an j ^* ^ n -(2p+2a)/(2p+2a+i)_ Q n tne otner hand, in 

case (ii) (6m* /^m.) YlT=i UJ j/ v j = m * P+1 ex P( m * a ) implies that the condition on to* and 
5* writes to* ~ (logn) 1 /^ 2 ' 1 ) and 5* ~ n _1 (logn) 1// ( 2a ^. Consequently, the lower bounds in 
Proposition 13.11 follow by applying Theorem 12. 3[ □ 

PROOF of Proposition 13.21 Note, that for sufficiently large n the condition on 7 in The- 
orem 12.41 writes 7 = n because (5^) is increasing. Furthermore, it is easily seen that the 
additional condition (|2.6p is satisfied in the exponential case and for all k ^ 2+8/ (2j>+2a— 1) 
also in the polynomial case. Finally, since in both cases the condition on to ensures that 
to ~ m* (see the proof of Proposition 13. ip the result follows from Theorem 12.41 □ 

The estimation of derivatives. 

Proof of Proposition 13.31 Since for each ^ s ^ p we have E||/?M || 2 ~ E||/3-/3|| 2 8 
we can apply again the general results by considering the W w -risk with u> = b s . In case (i) 
the well-known approximation YllLi 3 T ~ m T for r > together with the definition of \P- 
and Vj imply (&^„ /^mS) Sj=i LO j/ v j ~ m* a+2p+1 . It follows that the condition on to* and 
5* given in <^ of Theorem O writes to* ~ n i/(2 P +2a+i) and 5 * ^ n -(2p-2s) /(2 P +2a+i)^ Q n 
the other hand, in case (ii) by applying Laplace's Method (c.f. chapter 3.7 in Olver ( 19741 )) 



the definition of bj and Vj imply (C/^mJEjl'i^iM ~ m* p exp(m 2a ) implies that the 
condition on m* and <5* can be rewritten as to* ~ (logn) 1 ^ 20 - 1 and 5* ~ n~ 1 (logn) 1 /( 2a ). 
Consequently, the lower bounds in Proposition 13.11 follow by applying Theorem 12. 3[ □ 



Proof of Proposition 13. 4[ The proof follows in analogy to the proof of Proposition! 
and we omit the details. □ 
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